A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Allows us to access packages to analyze data
import pandas as pd
# Helps us in analyzing and breaking down arrays.
import numpy as np
# To get different metric scores.
from sklearn import metrics
# Assists in displaying visualizations and graphs.
import matplotlib.pyplot as plt
%matplotlib inline
# Allows us to make statistical analysis.
import seaborn as sns
# Helps us in splitting the data into train and test datasets.
from sklearn.model_selection import train_test_split
# The two following libraries allows us to use decision tree analysis.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
accuracy_score,
roc_curve,
confusion_matrix,
roc_auc_score
)
# Allows us to use the GridSearchCV function
from sklearn.model_selection import GridSearchCV
# We load the data provided by INN Hotels Group
data = pd.read_csv("INNHotelsGroup.csv")
# Shape of the data, which turns out to have 36275 rows and 19 columns.
data.shape
(36275, 19)
# We look at the first 5 rows of the dataset.
data.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
On the surface, we see that bookings usually consist of one or two adults with no kids but that might be misleading. We also see that a majority of the customers have the just have the breakfast option ('Meal Plan 1'). We also see that most customers reserve a 'Room_Type 1" and all of them booked online.
data.describe(include = 'all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Booking_ID | 36275 | 36275 | INN00001 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| no_of_adults | 36275.0 | NaN | NaN | NaN | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 |
| no_of_children | 36275.0 | NaN | NaN | NaN | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | NaN | NaN | NaN | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.0 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | NaN | NaN | NaN | 2.2043 | 1.410905 | 0.0 | 1.0 | 2.0 | 3.0 | 17.0 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| required_car_parking_space | 36275.0 | NaN | NaN | NaN | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| lead_time | 36275.0 | NaN | NaN | NaN | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.0 | 126.0 | 443.0 |
| arrival_year | 36275.0 | NaN | NaN | NaN | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.0 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | NaN | NaN | NaN | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.0 | 10.0 | 12.0 |
| arrival_date | 36275.0 | NaN | NaN | NaN | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.0 | 23.0 | 31.0 |
| market_segment_type | 36275 | 5 | Online | 23214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| repeated_guest | 36275.0 | NaN | NaN | NaN | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | NaN | NaN | NaN | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | NaN | NaN | NaN | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.0 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | NaN | NaN | NaN | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | NaN | NaN | NaN | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
From this dataset, we see that the mean number of adults who per booking tends to be 2. We also see that children rarely stay in the hotel (with a very low average). We also see that most customers who book tended not to require a parking space and that the average person that books during the week tends to stay two nights.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
We see that the columns that contain object type of data is the Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, and book_status. All of these columns are perfectly fine to have object type of data since there are multiple selections and classifications. We also see that avg_price_per_room is a float but that's logical as we're talking about dollars and cents. The other columns have integer types of data.
Leading Questions:
# To help assist in answering some of these questions, we establish a function
# named 'labeled_barplot', which will take in the data, the feature we specify,
# and then give us a labeled barplot with percentages at the top of each column.
def labeled_barplot(data, feature, perc=False, n=None):
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# We also include a function that is called 'stacked_barplot', that takes in the
# the data, a predictor variable, and a target variable. The target variable in
# our case will be about whether someone cancelled their booking or not.
def stacked_barplot(data, predictor, target):
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# We also includ this function named 'stacked_barplot'. What this allows us to do is to
# make a stacked graph, looking at the predictor and target variable of the dataset.
def stacked_barplot(data, predictor, target):
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# To answer the first question regarding the busiest month in the hotel,
# we use the labeled_barplot() function, and specify that we want to look at the
# 'arrival_month' column of the data and we also want to see the percentages of
# each value. Thus we can see how much the data consists of bookings in September
# ('9') or in other months.
labeled_barplot(data, 'arrival_month', perc = True)
From this graph, we see that a majority of the bookings are made in October, followed closely by September and August. The interesting thing is that these months are consectutive, consists of 37.9% of the data, and it increases as time passes until November. Thus, I would be curious as to whether there is marketing or some other method that encourages people to stay during these months.
# We apply a similar method in analyzing the proportion of
# types of Market segment.
labeled_barplot(data, 'market_segment_type', perc = True)
In the plot displayed, we see that a majority of the customers that book are booking online. The follow-up are offline cases, where someone made a phone call or walked in and asked for a booking. Based on this, we can see that the hotel would do well to focus on marketing on the internet.
# Another question we have is whether, based on the data, there are differences in room prices in different market segments.
# In order to best answer this question, I decided to use a boxplot function, with 'market_segment_type' in the x-axis and
# 'avg_price_per_room'.
sns.boxplot(data, x= 'market_segment_type', y = 'avg_price_per_room')
# Shows the graph with no coding messages.
plt.show()
From the above graph, we see that the mean of those who book a room online tends to have a higher cost than those who order offline or those who use a coporate account. This is unusual considering how a majority of people that book a reservation do it online. It could have to do with customers online booking rooms that cost more or that they plan further ahead. We see that corporate segmentation is lower than the other market types (except comlementary), which may have to do with the low rate companies are charged or the room types booked on behalf of the employees.
labeled_barplot(data, 'booking_status', perc = True)
We see from this graph that 67.2% of customers that make a booking do not cancel their reservation while 32.8% of the customers do cancel. This could range from a number of factors but the purpose of this dataset is to look at metrics we can measure. By using the methods available to us, we can make inference whether there's any variables that could make a positive or negative influence on the decision.
# We use the 'stacked_barplot()' function, using the 'repeated_guest' as the predictor and 'booking_status' as the target
stacked_barplot(data, 'repeated_guest', 'booking_status')
# Shows the graph with no coding messages.
plt.show()
booking_status Canceled Not_Canceled All repeated_guest All 11885 24390 36275 0 11869 23476 35345 1 16 914 930 ------------------------------------------------------------------------------------------------------------------------
What percentage of repeating guests cancel?
From this graph, we see that a repeated guest is significantly less likely to cancel a booking than someone who is a first time customer. This is very helpful to us as it affirms that its not so much the service as something else that influences whether a person cancels their booking or not. However, we still see that minority of repeat guests do cancel their reservation, so there might be a service factor that we can identify and change to shrink that proportion.
# We use the 'stacked_barplot()' function, looking at 'no_of_special_requests' as the predictor and 'booking_status' as the
# target variable.
stacked_barplot(data, 'no_of_special_requests', 'booking_status')
booking_status Canceled Not_Canceled All no_of_special_requests All 11885 24390 36275 0 8545 11232 19777 1 2703 8670 11373 2 637 3727 4364 3 0 675 675 4 0 78 78 5 0 8 8 ------------------------------------------------------------------------------------------------------------------------
From this graph, we see that as the number of special requests increases from 0 to 2, the proportion of people cancelling their booking decreased. When we get to more than 2 special requests, the number of people who cancel their bookings is practically non-existant. Thus, we can see that the number special requests has an inverse relationship with cancellation.
# We use the 'stacked_barplot()' function, looking at 'arrival_month' as the predictor and 'booking_status' as the
# target variable.
stacked_barplot(data, 'arrival_month', 'booking_status')
booking_status Canceled Not_Canceled All arrival_month All 11885 24390 36275 10 1880 3437 5317 9 1538 3073 4611 8 1488 2325 3813 7 1314 1606 2920 6 1291 1912 3203 4 995 1741 2736 5 948 1650 2598 11 875 2105 2980 3 700 1658 2358 2 430 1274 1704 12 402 2619 3021 1 24 990 1014 ------------------------------------------------------------------------------------------------------------------------
From this graph, we see that least booking cancellations happened in January and the most occurred in July. We also see October sees a relative proportion of cancellations over not cancelling bookings, which is interesting considering how October is the busiest time of the year. We also see that there's a large spike jump from January to Feburary in people cancelling their bookings.
# We look to see the unique values in the 'type_of_meal_plan'
data['type_of_meal_plan'].unique()
# We look to see the unique values in the 'room_type_reserved'
data['room_type_reserved'].unique()
array(['Room_Type 1', 'Room_Type 4', 'Room_Type 2', 'Room_Type 6',
'Room_Type 5', 'Room_Type 7', 'Room_Type 3'], dtype=object)
# We make a copy of the data to avoid making direct changes to the read-in data
# Which we name df1
df1 = data.copy()
# We check to see if there are any duplicate rows in the dataset.
df1.duplicated().sum()
0
# We check to see if there are any missing values present in each column.
# Surprisingly, we don't see any missing values.
df1.isnull().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
# We use this for loop to help us get 'object' data type columns into category type
# of variables.
for feature in df1.columns: # Loop through all columns in the dataframe
if df1[feature].dtype == 'object': # Only apply for columns with categorical strings
df1[feature] = pd.Categorical(df1[feature])# Replace strings with an integer
# Next we look at the number unique values in 'type_of_meal_plan' column.
print(df1.type_of_meal_plan.value_counts())
# We look at the number unique values in 'room_type_reserved' column.
print(df1.room_type_reserved.value_counts())
# We look at the number unique values in 'market_segment_type' column.
print(df1.market_segment_type.value_counts())
Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64 Room_Type 1 28130 Room_Type 4 6057 Room_Type 6 966 Room_Type 2 692 Room_Type 5 265 Room_Type 7 158 Room_Type 3 7 Name: room_type_reserved, dtype: int64 Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: market_segment_type, dtype: int64
# Seeing what we know, we then restructure the categorical columns to
# instead have numeric values instead.
# We create a dataset that replaces certain column values with numeric values.
replaceStruct = {
# We have the meal plan be with the their respective meal plan number but for those cases where
# a meal plan was not selected, we have that value as a '-1' to make it clear that this
# is a negative selection.
"type_of_meal_plan": {"Meal Plan 1": 1, "Meal Plan 2": 2, "Meal Plan 3": 3, "Not Selected": -1},
"room_type_reserved": {"Room_Type 1": 1, "Room_Type 2": 2, "Room_Type 3": 3, "Room_Type 4": 4,
"Room_Type 5": 5, "Room_Type 6": 6, "Room_Type 7": 7},
"market_segment_type": {"Online": 1, "Offline": 2, "Corporate": 3, "Complementary": 4, "Aviation": 5},
"booking_status": {'Not_Canceled': 1, 'Canceled': 0}
}
# We make a df2 dataset, where we replace the categorical values with numerical values we specified.
df2 = df1.replace(replaceStruct)
# We convert the 'type_of_meal_plan' column into an integer data type.
df2['type_of_meal_plan'] = df2['type_of_meal_plan'].astype(int)
# We convert the 'room_type_reserved' column into an integer data type.
df2['room_type_reserved'] = df2['room_type_reserved'].astype(int)
# We convert the 'market_segment_type' column into an integer data type.
df2['market_segment_type'] = df2['market_segment_type'].astype(int)
# We convert the 'booking_status' column into an integer data type.
df2['booking_status'] = df2['booking_status'].astype(int)
# Displays information on the dataset 'df2', ensuring that a majority (except for the Booking_ID column) are numeric values.
df2.info()
# Drops all the cases where the value "NA" value exists.
df2 = df2.dropna()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null category 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null int32 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null int32 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null int32 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null int32 dtypes: category(1), float64(1), int32(4), int64(13) memory usage: 5.9 MB
# We split the data, so we can prepare it for further analysis.
# We create an 'X' variable that takes the df2 dataset and drop the 'booking_status' and 'Booking_ID' columns.
# We drop the 'Booking_ID' column as it's just an identifier and we drop 'booking_status' as we're trying to predict that.
X = df2.drop(['booking_status', 'Booking_ID'], axis = 1)
# We create Y to be the 'booking_status' column.
Y = df2['booking_status']
# adding a constant to X variable
X = add_constant(X)
# creating dummy variables, in case we missed something.
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
# While we could look at each distribution by making seperate graphs and analyze each one separately, its admittedly a
# bit much to do when we have 19 columns.So, we're going to look at them collectively together, to make it simplier.
# We use the default setting "darkgrid" asthestic theme.
sns.set_style("darkgrid")
# We then ask it to use the df2 dataset in making a series of histograms, with the size specified.
df2.hist(figsize=(15, 10))
# We display the graph with no coding messages.
plt.show()
Admittedly, this isn't all the columns of the dataset (as we're missing 'Booking_ID'). However, this series of graphs does tell us the data lacks normal distribution. Some graphs, like 'repeated_guest', 'no_of_previous_cancellations', and 'no_of_previous_bookings_not_canceled' only have one column because its an accumulation of a non-zero value occuring. Now, to break down graph insight bit by bit.
# We now move into making bivariate analysis, in looking to see if a variable correlates with another variable.
# We specify the size of the graph we're about to make
plt.figure(figsize=(15, 7))
# We make a heatmap graph of the df2 dataset, with annotations. We expand the correlation to be -1 and 1 so we can see
# positive and negative correlations.
sns.heatmap(df2.corr(), annot=True, vmin=-1, vmax=1)
# Displays the graph with no coding messages.
plt.show()
C:\Users\Neal Daniels\AppData\Local\Temp\ipykernel_16628\2352968919.py:7: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(df2.corr(), annot=True, vmin=-1, vmax=1)
plt.figure(figsize=(12, 5))
# We make a pair plot, looking at the 'df2' dataset.
sns.pairplot(df2, hue="booking_status", kind = "reg")
# Shows the graph with no coding messages.
plt.show()
<Figure size 1200x500 with 0 Axes>
Admittedly, the pair graph we have is hard to make out, however, in studying the graph, we see some interesting correlations.
# We create the checking_vif() function, which helps us identify the
# VIF scores of the predictor variables
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
# We print the values of the calculated VIF for each variable.
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: const 3.754992e+07 no_of_adults 1.291706e+00 no_of_children 1.266959e+00 no_of_weekend_nights 1.060873e+00 no_of_week_nights 1.082720e+00 type_of_meal_plan 1.312574e+00 required_car_parking_space 1.029356e+00 room_type_reserved 1.497096e+00 lead_time 1.274362e+00 arrival_year 1.367498e+00 arrival_month 1.265919e+00 arrival_date 1.005103e+00 market_segment_type 1.649497e+00 repeated_guest 1.676494e+00 no_of_previous_cancellations 1.358922e+00 no_of_previous_bookings_not_canceled 1.625360e+00 avg_price_per_room 1.730931e+00 no_of_special_requests 1.195382e+00 dtype: float64
We see that the only high VIF value we have is for the constant while the other variables have a VIF value between 1 and 2. This shows low multicollinearity, meaning that practically none of the other variables values correlate with the values in the other columns. This affirms that we cleaned up the data and makes it so we can preform our logistic regression with confidence that this assumption is satisfied.
# We fit the model on the training set
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit()
Optimization terminated successfully.
Current function value: 0.434582
Iterations 12
# We look at the summary of the logistic regression model.
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25374
Method: MLE Df Model: 17
Date: Fri, 17 Mar 2023 Pseudo R-squ.: 0.3129
Time: 17:35:23 Log-Likelihood: -11035.
converged: True LL-Null: -16060.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 1067.0271 115.529 9.236 0.000 840.595 1293.459
no_of_adults -0.0365 0.036 -1.002 0.317 -0.108 0.035
no_of_children 0.0348 0.046 0.750 0.453 -0.056 0.126
no_of_weekend_nights -0.1479 0.019 -7.591 0.000 -0.186 -0.110
no_of_week_nights -0.0593 0.012 -4.935 0.000 -0.083 -0.036
type_of_meal_plan 0.1126 0.024 4.723 0.000 0.066 0.159
required_car_parking_space 1.4987 0.136 11.018 0.000 1.232 1.765
room_type_reserved 0.0756 0.015 4.987 0.000 0.046 0.105
lead_time -0.0141 0.000 -57.666 0.000 -0.015 -0.014
arrival_year -0.5281 0.057 -9.225 0.000 -0.640 -0.416
arrival_month 0.0389 0.006 6.153 0.000 0.027 0.051
arrival_date -0.0018 0.002 -0.926 0.354 -0.005 0.002
market_segment_type 1.2156 0.043 28.420 0.000 1.132 1.299
repeated_guest 1.2479 0.478 2.613 0.009 0.312 2.184
no_of_previous_cancellations -0.1303 0.076 -1.714 0.087 -0.279 0.019
no_of_previous_bookings_not_canceled 0.0258 0.070 0.368 0.713 -0.112 0.163
avg_price_per_room -0.0180 0.001 -25.732 0.000 -0.019 -0.017
no_of_special_requests 1.3862 0.029 47.544 0.000 1.329 1.443
========================================================================================================
We have coeffiecients for the variables but we need to keep in mind that they're odd. We also see that a majority of the variables have are significant except for a handful of variables. This includes number of adults, number of children, arrival date, number of previous cancellations, and number of previous bookings not cancelled. This is reassuring in a way because it tells us that the regular service at the hotel is not the driving force behind cancellations.
The null hypothesis for logistic regression is that there no relationship between the dependent variables and our independent variable (booking status). Since the p-value is high for some of these variables and exceeds our default level of significance, we will later have to drop some of these variables later on.
# To see if we're predicting accurately or if this is just fluff, we take a look and see whether we can predict on
# on the training data to have default threshold greater 0.5 (50-50 chance) that the observation will be classified as 1.
# We make a variable named 'pred_train' and look at caes where the threshold is greater than 0.5
pred_train = lg.predict(X_train) > 0.5
# We then round the values in the 'pred_train' variable.
pred_train = np.round(pred_train)
# We make a confusion matrix, using the y_train and pred_train datasets.
cm = confusion_matrix(y_train, pred_train)
# We specify the size of the confusion matrix.
plt.figure(figsize=(7, 5))
# We make a heatmap, using the 'cm' variable.
sns.heatmap(cm, annot=True, fmt="g")
# We specify the x-axis label to be 'Predicted Values'
plt.xlabel("Predicted Values")
# We label the y-axis to be 'Actual Values'
plt.ylabel("Actual Values")
# We display the confusion matrix without coding messages.
plt.show()
We see that of the data present that 5134 cases are correctly identified as cancelling their booking. We also see that 1806 cases were incorrectly identified as not cancelling their request but they did anyway. Of the 36275 cases, 15267 cases booked their reservations and were not predicted to cancel while the 1806 were predicted to cancel their reservations but didn't.
# We print the accuracy of our training set.
print("Accuracy on training set : ", accuracy_score(y_train, pred_train))
Accuracy on training set : 0.8034420289855072
From this we see that our model is roughly 80% accurate which confirms that our model is relatively accurate. However, we should keep in mind that the variables that we know aren't significant may be impacting our accuracy of predicting whether someone will cancel their booking or not.
# We remove the columns that according to the Logistic model earlier, showed to be not significant.
X_2 = X.drop(['arrival_date', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'no_of_adults','no_of_children'], axis = 1)
# We make new training and test datasets based on the changes we made.
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
X_2, Y, test_size=0.30, random_state=1, stratify=Y
)
# We once again make a new logistic regression model.
logit_2 = sm.Logit(y_train, X_train_2.astype(float))
# We fit the logistic regression model
lg_2 = logit_2.fit()
Optimization terminated successfully.
Current function value: 0.434697
Iterations 11
# We print a summary of the logistic regression model.
print(lg_2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25379
Method: MLE Df Model: 12
Date: Fri, 17 Mar 2023 Pseudo R-squ.: 0.3127
Time: 17:35:24 Log-Likelihood: -11038.
converged: True LL-Null: -16060.
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 1064.4659 115.359 9.227 0.000 838.367 1290.565
no_of_weekend_nights -0.1488 0.019 -7.654 0.000 -0.187 -0.111
no_of_week_nights -0.0591 0.012 -4.919 0.000 -0.083 -0.036
type_of_meal_plan 0.1131 0.024 4.750 0.000 0.066 0.160
required_car_parking_space 1.4985 0.136 11.016 0.000 1.232 1.765
room_type_reserved 0.0752 0.014 5.187 0.000 0.047 0.104
lead_time -0.0142 0.000 -58.193 0.000 -0.015 -0.014
arrival_year -0.5269 0.057 -9.217 0.000 -0.639 -0.415
arrival_month 0.0396 0.006 6.267 0.000 0.027 0.052
market_segment_type 1.2217 0.042 28.788 0.000 1.139 1.305
repeated_guest 0.9243 0.349 2.647 0.008 0.240 1.609
avg_price_per_room -0.0179 0.001 -26.568 0.000 -0.019 -0.017
no_of_special_requests 1.3836 0.029 47.784 0.000 1.327 1.440
==============================================================================================
# We determine accuracy of the training dataset.
pred_train_2 = lg_2.predict(X_train_2)
pred_train_2 = np.round(pred_train_2)
# We print the accuracy of the training dataset.
print("Accuracy on training set : ", accuracy_score(y_train, pred_train_2))
Accuracy on training set : 0.8030875866414618
Based on this, we see that our accuracy on the training set is relatively high, though not significantly different from our accuracy before we made changes.
# We convert the coefficients of our logistic regression model into odds
odds = np.exp(lg_2.params)
# We add the odds to an empty dataframe.
pd.DataFrame(odds, X_train_2.columns, columns=["odds"]).T
C:\Users\Neal Daniels\Anaconda 2\lib\site-packages\pandas\core\arraylike.py:402: RuntimeWarning: overflow encountered in exp result = getattr(ufunc, method)(*inputs, **kwargs)
| const | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | market_segment_type | repeated_guest | avg_price_per_room | no_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| odds | inf | 0.861704 | 0.942602 | 1.119776 | 4.474784 | 1.078079 | 0.985922 | 0.590438 | 1.040367 | 3.392926 | 2.520058 | 0.982211 | 3.989345 |
# Now we find the percentages of change
perc_change_odds = (np.exp(lg_2.params) - 1) * 100
# adding the change_odds% to a dataframe
pd.DataFrame(perc_change_odds, X_train_2.columns, columns=["change_odds%"]).T
C:\Users\Neal Daniels\Anaconda 2\lib\site-packages\pandas\core\arraylike.py:402: RuntimeWarning: overflow encountered in exp result = getattr(ufunc, method)(*inputs, **kwargs)
| const | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | market_segment_type | repeated_guest | avg_price_per_room | no_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| change_odds% | inf | -13.829649 | -5.739763 | 11.977628 | 347.478446 | 7.807904 | -1.407811 | -40.956236 | 4.036707 | 239.292624 | 152.005771 | -1.778936 | 298.934532 |
Holding all other features constant, a unit change in 'no_of_weekend_nights' increase the odds of person cancelling their booking by 0.1488 or 13.83% increase in the odds of cancelling (we have to keep in mind that in this case, the closer we get to zero, the more likely someone will cancel).
Holding all other features constant, a unit change in 'no_of_week_nights' increase the odds of person cancelling their booking by 0.0591 or 5.74% increase in the odds of cancelling.
Holding all other features constant, a unit change in the meal plan selected (or not selected, should they decide) decreases the odds of person cancelling their booking by 0.1131 or 11.98% decrease in the odds of cancelling.
Holding all other features constant, a unit change in the needing a parking space decreases the odds of person cancelling their booking by 1.4985 or 347.48% decrease in the odds of cancelling.
Holding all other features constant, a unit change in the reserving a certain type of room decreases the odds of person cancelling their booking by 0.0752 or 7.81% decrease in the odds of cancelling.
Holding all other features constant, a unit change in 'lead_time' increases the odds of person cancelling their booking by 0.0142 or 1.41% increase in the odds of cancelling.
Holding all other features constant, a unit change in 'arrival_year' increases the odds of person cancelling their booking by 0.5269 or 40.956% increase in the odds of cancelling.
Holding all other features constant, a unit change in 'arrival_month' decreases the odds of person cancelling their booking by 0.0396 or 4.04% decrease in the odds of cancelling
Holding all other features constant, a unit change in 'market_segment_type' decreases the odds of person cancelling their booking by 1.2217 or 239.29% decrease in the odds of cancelling.
Holding all other features constant, a unit change regarding a person being a repeat customer decreases the odds of person cancelling their booking by 0.9243 or 152.01% decrease in the odds of cancelling.
Holding all other features constant, a unit change in the average price per room increases the odds of person cancelling their booking by 0.0179 or 1.778% increase in the odds of cancelling.
Holding all other features constant, a unit change in the number of special requests will decreases the odds of person cancelling their booking by 1.3836 or 298.93% decrease in the odds of cancelling.
# We make another confusion matrix, this time looking at the new logistic regression model.
cm = confusion_matrix(y_train, pred_train_2)
# Specify the size of the graph
plt.figure(figsize=(7, 5))
# We make a display of the confusion matrix by using the heatmap.
sns.heatmap(cm, annot=True, fmt="g")
# We label the x axis with "Predicted Values"
plt.xlabel("Predicted Values")
# Label the y-axis with "Actual Values"
plt.ylabel("Actual Values")
# Displays the graph with no coding messages.
plt.show()
From this graph, we see that 5140 customers were predicted to cancel their requests and did so. However, we failed to predict that 3179 customers would cancel but they did. However, we also see that we predicted that 15252 customers would not cancel their bookings and that's what happened. In addition, we see that 1821 cases were predicted to cancel their bookings but they didn't.
# We then make a ROC Curve on the training set as we are unsure how accurate our model and would like to better
# ensure our probability of predicting a customer's cancellation accurately.
logit_roc_auc_train = roc_auc_score(y_train, lg_2.predict(X_train_2))
fpr, tpr, thresholds = roc_curve(y_train, lg_2.predict(X_train_2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
From this ROC Curve, we see that the area under the curve is 0.86. This means that the model is performing well on the training set.
#Now, we use the model to make predictions on the test data.
# We first drop the same columns we dropped in the X_train_2 columns.
X_test_2 = X_test.drop(['arrival_date', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'no_of_adults','no_of_children'], axis=1)
# We set the threshold of of the predicted probability of the test data set to
# be greater than 0.5. Thus the values will be classified as 1 (not cancelling).
# We set this calculation to be 'pred_test_3'.
pred_test_3 = lg_2.predict(X_test_2) > 0.5
# We round the threshold of the earlier value.
pred_test_3 = np.round(pred_test_3)
# We print the accuracy of the model on the training data set
print("Accuracy on training set : ", accuracy_score(y_train, pred_train_2))
# We print the accuracy of the model on the test data set
print("Accuracy on test set : ", accuracy_score(y_test, pred_test_3))
Accuracy on training set : 0.8030875866414618 Accuracy on test set : 0.8050169989892493
We have it that the accuracy of the training and test data sets are relatively the same, which means that our data is not underfitting or overfitting.
# This code is used to make a decision tree with the criterion set to 'gini' as we want to use the Gini Index in
# making a classifying decision tree. We also set the max depth of the decision tree to 5, to make it so we can
# see what it tells us, without overfitting.
dTree = DecisionTreeClassifier(criterion = 'gini', max_depth = 5, random_state=1)
# We then fit the decsion tree model with the X_train and y_train datasets.
dTree.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, random_state=1)
# We look at the accuracy of the decision tree using the training dataset.
print("Accuracy on training set : ",dTree.score(X_train, y_train))
# We also look at the accuracy of the decision tree using the test dataset.
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set : 0.837822936357908 Accuracy on test set : 0.8327667003583571
Admittedly, The numbers are not significantly as high as we would like but considering that this is a first attempt, the accuracy of the decision tree is remarkably high and its approximately the same whether we're looking at the test or training data sets. Therefore, we know our decision tree model isn't overfitting or underfitting.
#We check for the number of positives
Y.sum(axis = 0)
24390
We have 24390 positives out of 36275 (I used the Y.shape to get that number), which means that we get a majority of the decision tree (roughly 93%) is positive. This aligns with our earlier analysis that only a small percentage of customers cancel their bookings.
# We create a 'make_confusion_matrix' function that we can later use to create
# a confusion matrix.
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
# We make a confusion matrix, based on the y_test
make_confusion_matrix(dTree,y_test)
From this graph, we see that approximately 21.75% were predicted to cancel and did decide to cancel their booking. We also see that 5.71% didn't cancel their booking but they were predicted to do so. We also see that 11.02% were predicted to keep to their agreement regarding the booking but decided to withdraw their booking requests. We also see that 61.53% were predicted to keep their booking and did so. Together, we're looking at 67.24% who decided to stay with with their bookings, with 32.77% cancelling their bookings, based on the Decision Tree algorithm.
# We then look at the recall of the decision tree to help us better analzyze its accuracy
#We do this by looking at the Recall of the decision tree, both the training data set and test data set.
get_recall_score(dTree)
Recall on training set : 0.9218063609207521 Recall on test set : 0.915129151291513
# We make a variable named "feature_names" which takes all the column names of the "X" variable.
feature_names = list(X.columns)
# We print the list contained in the "feature_names" column.
print(feature_names)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'market_segment_type', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']
# We specify the size of the decision tree graph.
plt.figure(figsize=(20,30))
# We then plot the decision tree, with the features being labeled using the 'feature_names' variable created earlier.
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
# We display the graph with no coding messages.
plt.show()
# Admittedly, the graph above looks a little complex and difficult to read. So, we can look athe text of the tree
# and then see how the tree branches out this way.
print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type <= 1.50 | | | |--- lead_time <= 20.50 | | | | |--- lead_time <= 3.50 | | | | | |--- weights: [100.00, 622.00] class: 1 | | | | |--- lead_time > 3.50 | | | | | |--- weights: [406.00, 686.00] class: 1 | | | |--- lead_time > 20.50 | | | | |--- avg_price_per_room <= 74.69 | | | | | |--- weights: [143.00, 223.00] class: 1 | | | | |--- avg_price_per_room > 74.69 | | | | | |--- weights: [2103.00, 939.00] class: 0 | | |--- market_segment_type > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- weights: [93.00, 2435.00] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- weights: [268.00, 1449.00] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- weights: [345.00, 388.00] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- weights: [76.00, 405.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type <= 1.50 | | | | |--- lead_time <= 9.50 | | | | | |--- weights: [76.00, 1075.00] class: 1 | | | | |--- lead_time > 9.50 | | | | | |--- weights: [928.00, 3442.00] class: 1 | | | |--- market_segment_type > 1.50 | | | | |--- no_of_weekend_nights <= 2.50 | | | | | |--- weights: [21.00, 1058.00] class: 1 | | | | |--- no_of_weekend_nights > 2.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 89.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [0.00, 2185.00] class: 1 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- weights: [45.00, 301.00] class: 1 | | | |--- lead_time > 89.50 | | | | |--- avg_price_per_room <= 202.14 | | | | | |--- weights: [107.00, 482.00] class: 1 | | | | |--- avg_price_per_room > 202.14 | | | | | |--- weights: [7.00, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- market_segment_type <= 1.50 | | | | |--- avg_price_per_room <= 8.00 | | | | | |--- weights: [4.00, 8.00] class: 1 | | | | |--- avg_price_per_room > 8.00 | | | | | |--- weights: [636.00, 6.00] class: 0 | | | |--- market_segment_type > 1.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- weights: [57.00, 357.00] class: 1 | | | | |--- no_of_adults > 1.50 | | | | | |--- weights: [578.00, 335.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- weights: [9.00, 59.00] class: 1 | | | | |--- lead_time > 180.50 | | | | | |--- weights: [125.00, 47.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type <= 1.50 | | | | | |--- weights: [101.00, 322.00] class: 1 | | | | |--- market_segment_type > 1.50 | | | | | |--- weights: [3.00, 153.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [2068.00, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [0.00, 33.00] class: 1 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [0.00, 49.00] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- room_type_reserved <= 1.50 | | | | | |--- weights: [1.00, 6.00] class: 1 | | | | |--- room_type_reserved > 1.50 | | | | | |--- weights: [17.00, 8.00] class: 0
# We print an array with the magnititude of feature importances, which we can use to decide what
# importance the model has given to each feature.
# We name the column with the array of numerical values "Importance", with the index being the column names of the X_train
# data set and we sort the values in descending order.
print (pd.DataFrame(dTree.feature_importances_, columns = ["Importance"], index = X_train.columns).sort_values(by = 'Importance', ascending = False))
Importance lead_time 0.500657 market_segment_type 0.203872 no_of_special_requests 0.136736 avg_price_per_room 0.096339 no_of_adults 0.026819 arrival_month 0.017377 no_of_weekend_nights 0.015656 no_of_week_nights 0.001938 room_type_reserved 0.000605 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 const 0.000000 arrival_date 0.000000 required_car_parking_space 0.000000 type_of_meal_plan 0.000000 no_of_children 0.000000 arrival_year 0.000000
The first thing we see 'lead_time' and 'market_segment_type' have the highest values, followed by 'no_of_special_requests'. That gives us the top three variables that make the largest impact regarding whether someone cancels their booking or not.
# Admittedly, the numbers from the above table may not be giving us enough or we're just cherry-picking.
# So, we're going to graph the importance of the features given to us by the model.
# We assign the array of the importances to the variable 'importances'
importances = dTree.feature_importances_
# We sort the variable and name it 'indices'
indices = np.argsort(importances)
# We specify the size of the graph
plt.figure(figsize=(10,10))
# We have the title of the graph be "Feature Importances"
plt.title('Feature Importances')
# We create bar plot horizontally, with a lime color and have it centralized.
plt.barh(range(len(indices)), importances[indices], color='lime', align='center')
# We have the y-axis marks be the values given under 'indices'.
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
# We have the x-axis labeled "Relative Importance"
plt.xlabel('Relative Importance')
# We show the graph with no coding messages.
plt.show()
# We define a function to compute the different metrics to check performance of a classification model built using sklearn,
# to check the classification model performance.
def model_performance_classification_sklearn(model, predictors, target):
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# We make a variable named 'decision_tree_perf_train' which takes the result of using the
#'model_performance_classification_sklearn()' function to compute metrics of the model on the training dataset.
decision_tree_perf_train = model_performance_classification_sklearn(
dTree, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.837823 | 0.921806 | 0.849738 | 0.884306 |
#We create a function to make a confusion matrix with percentages
#We call this function 'confusion_matrix_sklearn'
def confusion_matrix_sklearn(model, predictors, target):
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
confusion_matrix_sklearn(dTree, X_train, y_train)
# We can see that certain variables have a greater level of importance than others
# but we need to see how the model is to be improved
# without possibly stifling our model.
# Our first approach is the GridSearch method, which searches for the best combination of hyperparamters
# from a predefined set of values and is part of pre-pruning.
#Our first step is to choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
#We create a grid of parameters to choose from
parameters = {
"max_depth": [2, 4, 50, 10, 20],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}
#We create a variable named 'acc_scorer', which is a type of scoring
# used to compare parameter combinations.
acc_scorer = make_scorer(recall_score)
# We run the grid search with Cross Validation
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=13)
# We use the prior variable created, 'grid_obj' and fit it to the X_train
# and y_train datasets.
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=4,
min_impurity_decrease=1e-06, random_state=1,
splitter='random')
confusion_matrix_sklearn(dTree, X_train, y_train)
Looking at the confusion matrix of the training data set, we see that 21.80% were predicted to cancel and did so. We also see that an additional 10.96% also cancelled, though it was also predicted not to do so. We also see that 61.98% of the train data did stay true to their booking agreements while an additional 5.26% were predicted to cancel their booking but didn't.
# We now check the performance on the training dataset.
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.765517 | 0.937152 | 0.766247 | 0.843126 |
We see that all the metrics checking the training data is remarkably high. However, now we need to check how it is with the test data set and see how it fares.
decision_tree_perf_test = model_performance_classification_sklearn(
dTree, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.832767 | 0.915129 | 0.848132 | 0.880358 |
In this case, we see that the accuracy of the decision tree has dropped some but fortunately, its still relatively high. We also see that recall dropped and precision dropped but they're still relatively high. We also see that F1 score changed (which is understandable considering how the score is calculated using Precision and Recall) but the value is still remarkably high.
# We make a confusion matrix with percentages using the 'confusion_matrix_sklearn'
# function to see the numbers.
confusion_matrix_sklearn(dTree, X_test, y_test)
We see that 21.75% of the customers of the Decision tree model, when applied to the test dataset, were predicted to cancel their bookings and did so. We also see that 5.71% were predicted to cancel their bookings but didn't, for one reason or another. We see that an additional 11.02% were predicted not to cancel their request but did anyway. Finally, we see that roughly 61.53% were predicted to not cancel their bookings and stayed true to their agreements.
Comparing this confusion matrix with the one created for the train data, we see only marginally changes in each category. This shows further validates our decision tree model is not overfitting nor underfitting.
# We now see whether our Decision tree model has become simplier
# using the GridSearch method.
tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
Unfortunately, the model we see is much more complex and harder to make out than orignally. We'll try looking at another method called Cost Complexity Pruning.
# We now use the Cost Complexity method, and label the variable containing
# the method 'clf'
clf = DecisionTreeClassifier(random_state=1)
# We then make a new variable named 'path', which looks at the X_train and
# y_train dataset through the Cost Complexity function
path = clf.cost_complexity_pruning_path(X_train, y_train)
# We use these variables to perform post-pruning
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# We make a data frame, which looks at the ccp_alpha values and their
# impurities.
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | 0.006842 |
| 1 | 0.000000e+00 | 0.006842 |
| 2 | 6.352013e-07 | 0.006843 |
| 3 | 8.078458e-07 | 0.006843 |
| 4 | 1.125214e-06 | 0.006844 |
| ... | ... | ... |
| 1321 | 7.158311e-03 | 0.285860 |
| 1322 | 1.347162e-02 | 0.299331 |
| 1323 | 1.668747e-02 | 0.316019 |
| 1324 | 2.423451e-02 | 0.364488 |
| 1325 | 7.608448e-02 | 0.440572 |
1326 rows × 2 columns
# Now, we graph the total impurity of the decision tree leaves and effective alpha for the training data set.
#We specify that the graph is to be a certain size.
fig, ax = plt.subplots(figsize=(15, 5))
# We have the alpha and impurities in a graph, using the "steps_post" method
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
# We label the x-axis "Effective Alpha"
ax.set_xlabel("Effective Alpha")
# We label the y-axis "Total Impurtiy of Leaves"
ax.set_ylabel("Total Impurity of Leaves")
# We label the title of the plot "Total Impurity vs Effective Alpha for Training set"
ax.set_title("Total Impurity vs Effective Alpha for Training set")
# We shoe the graph with no coding messages.
plt.show()
# We make an empty array named 'clfs'
clfs = []
# We run a for loop, which trains a decision tree using the effecitve alphas.
# The last value in 'ccp_alphas' is the aplahp value that prunes the entire
# decision tree (clf[-1]) with one node.
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.07608448246807803
# We then remove the last element in clfs
clfs = clfs[:-1]
# Then we remove the last element in 'ccp_alphas'. We remove these values because
# it is the trivial tree with only one node. Then we show the number of nodes and
# tree depth.
ccp_alphas = ccp_alphas[:-1]
# We then make a variable 'node_counts' which takes that number of nodes remaining
node_counts = [clf.tree_.node_count for clf in clfs]
# We then create a variable 'depth' which takes the depth of the pruned decision tree
depth = [clf.tree_.max_depth for clf in clfs]
# We specify the size of the graph
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
# We plot the graph, using the ccp_alphas and 'node_counts' variables.
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
# We label the x-axis 'alpha'
ax[0].set_xlabel("alpha")
# Label the y-axis "number of nodes"
ax[0].set_ylabel("number of nodes")
# Label the title of the graph "Number of nodes vs alph"
ax[0].set_title("Number of nodes vs alpha")
# We create another graph, looking at the 'ccp_alphas' and 'depth' variables
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
# We label hte x-axis 'alpha'
ax[1].set_xlabel("alpha")
# Label the y-axis "depth of tree"
ax[1].set_ylabel("depth of tree")
# Label the graph "Depth vs alpha"
ax[1].set_title("Depth vs alpha")
# We specify the layout of the graphs.
fig.tight_layout()
From the above graph, it confirms that the number of nodes and and tree depth decreases as alpha increases.
Now, we're not out of the woods yet as we're gooing to look at the Recalll and training in both the training and testing data sets.
# We create an empty list variable named "recall_train"
recall_train = []
# We run a for loop, that gives us an list of recall values of the
# train data set that we place
# inside the "recall_train" list.
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
# We creat an empty list variable named "recall_test"
recall_test = []
# We run a for loop, whichi gives us a list of recall values of the
# test data set that we place
# inside the "recall_test"
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
# We specify the size of the following graph
fig, ax = plt.subplots(figsize=(15, 5))
# We label the x-axis with "alpha"
ax.set_xlabel("alpha")
# We label the y-axis with "Recall"
ax.set_ylabel("Recall")
# We label the graph with "Recall vs alpha for training and testing sets"
ax.set_title("Recall vs alpha for training and testing sets")
# We graph the ccp_alphas on the x-axis, with the recall_train on the y-axis.
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
# We plot the ccp_alphas on the x-axis and recall_test on the y-axis.
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
# We include a legend to help us understand the graph.
ax.legend()
# We show the graph with no coding messages.
plt.show()
We notice that the train recall stays relatively higher than that of the test data recall as alpha increases. This is natural as our model was formed based on the train dataset. However, the good thing to notice from this graph is that the recall in either case is not significantly different, meaning that we're relatively close to forming a precise decision tree model. We also see that the alpha is higher in the beginning with the train dataset than the test data set but this may have to do with values that we already know the final outcome with.
# We create the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0019167262993599714, random_state=1)
Now, we have checked using the Recall vs Alpha on the training and testing data sets but we still need to check the model performance on both the training and test data sets.
# We make a variable 'decision_tree_postpruned_perf_train' to be contain
# the values of the of the metric values when we run the function
# model_performance_classification_sklearn() on the train dataset.
decision_tree_postpruned_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
# We then look at the metrics of the performance of the train data set.
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.838059 | 0.945294 | 0.835482 | 0.887002 |
# We make a confusion matrix looking at the y_train data set.
confusion_matrix_sklearn(best_model, X_train, y_train)
From this we see that 20.25% of the train data was correctly predicted to fail while 3.78 percent was predicted to fail but passed instead. We also see that 12.52% of the data was predicted to pass but failed while 63.56% passed all the way. Thus, while there may be diversity in the values of the confusion matrix, we see that a majority correctly predicted the data based on the decision tree model.
# We then run the 'model_performance_classification_sklearn()' function again,
# this time to look at the y_test data and name it 'decision_tree_postpruned_perf_test'
decision_tree_postpruned_perf_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
# We look at the metrics of the y_test data set.
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.830286 | 0.935356 | 0.832806 | 0.881107 |
We see that the accuracy, recall, precision, and F1 scores are remarkably but we also need to compare between this and the training data to make sure we're not overfitting or underfitting. Thankfully, in looking the two metrics, we see only marginal differences the model is fitting just fine. We also see that recall metric has improved in the test dataset.
Now, while we may have seen the model peformance on the training and test data sets, it might be good to visualize the decision tree again.
# We specify the size of the following graph
plt.figure(figsize=(10, 10))
# We establish the variable 'out' to plot the
# 'best_model' decision tree
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
# We run the for loop to include arrows in the decsion tree graph.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
# Displays the graph with no coding messages.
plt.show()
# Admittedly, the above graph is hard to make out where certain
# nodes split or what comes after. To avoid missing this, we
# make a text report, showing the rules of the decision tree.
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type <= 1.50 | | | |--- lead_time <= 20.50 | | | | |--- lead_time <= 3.50 | | | | | |--- weights: [100.00, 622.00] class: 1 | | | | |--- lead_time > 3.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [406.00, 501.00] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [0.00, 185.00] class: 1 | | | |--- lead_time > 20.50 | | | | |--- avg_price_per_room <= 74.69 | | | | | |--- weights: [143.00, 223.00] class: 1 | | | | |--- avg_price_per_room > 74.69 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2102.00, 879.00] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [1.00, 60.00] class: 1 | | |--- market_segment_type > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- weights: [361.00, 3884.00] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- weights: [345.00, 388.00] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- weights: [76.00, 405.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- weights: [1186.00, 8543.00] class: 1 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- market_segment_type <= 1.50 | | | | |--- weights: [640.00, 14.00] class: 0 | | | |--- market_segment_type > 1.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- weights: [57.00, 357.00] class: 1 | | | | |--- no_of_adults > 1.50 | | | | | |--- avg_price_per_room <= 84.62 | | | | | | |--- weights: [247.00, 294.00] class: 1 | | | | | |--- avg_price_per_room > 84.62 | | | | | | |--- weights: [331.00, 41.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- weights: [238.00, 581.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [2068.00, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [0.00, 33.00] class: 1 | | |--- arrival_month > 11.50 | | | |--- weights: [18.00, 63.00] class: 1
From this chart, we can see four cases out of several, why someone may cancel their request. With logistic regression, we can statistically calculate the probability of someone cancelling their booking. This is a step up from logistic regression, however, because we know, based on the data, where the breaking points are and help us identify what things need to change. We'll address these issues in our conclusion.
# We then look the importances of the variables in the new decision tree
# 'best_model' to confirm what variables are significant.
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Importance"], index=X_train.columns
).sort_values(by="Importance", ascending=False)
)
Importance lead_time 0.493508 market_segment_type 0.197838 no_of_special_requests 0.128723 avg_price_per_room 0.111303 arrival_month 0.029949 no_of_adults 0.027517 required_car_parking_space 0.011162 arrival_date 0.000000 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 const 0.000000 room_type_reserved 0.000000 type_of_meal_plan 0.000000 no_of_week_nights 0.000000 no_of_weekend_nights 0.000000 no_of_children 0.000000 arrival_year 0.000000
From this, we see that the level of importance has decreased in our new decision tree but it still confirms that 'lead_time', 'market_segment_type', and'no_of_special_requests'. However, we should still considering graphing the values to see whether we're overlooking certain variables importance in the decision tree.
# we make a variable named 'importance' whic takes the variables of the model.
importances = best_model.feature_importances_
indices = np.argsort(importances)
# We specify the size of the graph
plt.figure(figsize=(12, 12))
# We title the graph "Feature Importances"
plt.title("Feature Importances")
# We make horizontal bar graph, listing the variables based on values of importance
# and in the color blue.
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
# We have the marks on the y-axis be the names of the variables according to the
# level of significance.
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
# We label the x-axis "Relative Importance"
plt.xlabel("Relative Importance")
# We show the graph with no coding messages.
plt.show()
From this graph, we see that 7 variables have any level of importance that is worth considering. However, we should keep in mind the order of importance: 'lead_time', 'market_segment_type', 'no_of_special_requests', 'avg_price_per_room', 'arrival_month', 'no_of_adults', and 'required_car_parking_space'.
# Now, we preformed multiple methods to check the accuracy of the decision tree we created but which do we choose?
# Well, we're going to do a training performance comparison to see what the earlier methods tell us
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_postpruned_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.837823 | 0.765517 | 0.838059 |
| Recall | 0.921806 | 0.937152 | 0.945294 |
| Precision | 0.849738 | 0.766247 | 0.835482 |
| F1 | 0.884306 | 0.843126 | 0.887002 |
From this, we see that metrics checking whether our the validity of our decision tree is high in all cases. We also notice that all the values also increased after post-pruning except precision, which makes sense considering that we cut off excessive sub-trees.
In the logistic regression, we see what factors may encourage a customer not to cancel their bookings and what has an opposite affect. We see that if a customer requires a parking space, is part of a high market segment type, repeat customer, and makes special requests, they're all the more likely to stay while many other variables negatively impact that outcome. Using the decision tree, we know what variables have a large impact on a person cancelling their request, which coincides with our logistic regression.
Using the decision tree, we can make a few observations:
In order to cut down on these cases of cancellation, it might be well to make the following adjustments.